Team member:

Introduction

The data set includes data from a direct marketer who sells his products only via direct mail. He sends catalogs with product characteristics to customers who then order directly from the catalogs. The marketer has developed customer records to learn what makes some customers spend more than others.

The objective of this predictive modeling exercise is to predict the amount that will be spent in terms of the provided customer characteristics for a direct marketer who sells his products via mail. This analysis will be useful for the marketer to make strategic decisions about advertising and targetting a selected group of potential customers based on the amount that they are predicted to spend in the future.

The dataset DirectMarketing.csv contains 1000 records and 18 attributes. But only the first 10 columns are significant, 8 remaining columns are just one-hot encoded attributes from the first 10 row. As one-hot encoding is only used for linear regression part, so we decided to remove 8 last rows in Exploratory Data Analysis part for better speed.

Table of contents

0. Introduction
1. Importing Libraries
2. Data
3. Exploratory Data Analysis
- 3.1 Missing Value
- 3.2 Correlation Matrix
- 3.3 Describe Function

4. Data Analysis
- 4.1 Age
- 4.2 Gender
- 4.3 Own Home
- 4.4 Married
- 4.5 Location
- 4.6 Children
- 4.7 History
- 4.8 Catalogs
- 4.9 Amount Spent
- 4.10 Salary

5. Data Summary
6. Some regressions
7. Conclusion

1. Importing Libraries

# Install Tidyverse for data manipulation and visualization
install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
# Install MASS for statistical functions
install.packages("MASS")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
# Install glmnet for Lasso and Ridge regression
install.packages("glmnet")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
# Install boot for bootstrapping methods
install.packages("boot")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
# Install locfit for local regression
install.packages("locfit")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
# Install caret for machine learning modeling
install.packages("caret")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
# Install corrplot for correlation plot visualization
install.packages("corrplot")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
# Install ggplot2 for advanced data visualization
install.packages("ggplot2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
# Install reshape2 for data reshaping
install.packages("reshape2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
# Install dplyr for data manipulation
install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
# Install plotly for interactive plots
install.packages("plotly")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
# Install gridExtra for customizing plot layouts
install.packages("gridExtra")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
install.packages("Hmisc")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(MASS)
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
library(glmnet)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Loaded glmnet 4.1-8
library(boot)
library(locfit)
## locfit 1.5-9.8    2023-06-11
## 
## Attaching package: 'locfit'
## 
## The following object is masked from 'package:purrr':
## 
##     none
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'lattice'
## 
## The following object is masked from 'package:boot':
## 
##     melanoma
## 
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(corrplot)
## corrplot 0.92 loaded
library(glmnet)
library(ggplot2)
library(reshape2)
## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(dplyr)
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:MASS':
## 
##     select
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

2. Data

Load the data and look at its dimansions.

d_dimar <- read.csv('DirectMarketing.csv')
dim(d_dimar)
## [1] 1000   18

Look at the first few rows to get to know the contents.

head(d_dimar)
##      Age Gender OwnHome Married Location Salary Children History Catalogs
## 1    Old Female     Own  Single      Far  47500        0    High        6
## 2 Middle   Male    Rent  Single    Close  63600        0    High        6
## 3  Young Female    Rent  Single    Close  13500        0     Low       18
## 4 Middle   Male     Own Married    Close  85600        1    High       18
## 5 Middle Female     Own  Single    Close  68400        0    High       12
## 6  Young   Male     Own Married    Close  30400        0     Low        6
##   AmountSpent Gender_b Married_b Location_b Ownhome_b Age_y Age_m Hist_m Hist_h
## 1         755        1         0          0         1     0     0      0      1
## 2        1318        0         0          1         0     0     1      0      1
## 3         296        1         0          1         0     1     0      0      0
## 4        2436        0         1          1         1     0     1      0      1
## 5        1304        1         0          1         1     0     1      0      1
## 6         495        0         1          1         1     1     0      0      0

Now we gonna take only first 10 columns for data analysis

d_mar <- d_dimar[, 1:10]
dim(d_mar)
## [1] 1000   10
head(d_mar)
##      Age Gender OwnHome Married Location Salary Children History Catalogs
## 1    Old Female     Own  Single      Far  47500        0    High        6
## 2 Middle   Male    Rent  Single    Close  63600        0    High        6
## 3  Young Female    Rent  Single    Close  13500        0     Low       18
## 4 Middle   Male     Own Married    Close  85600        1    High       18
## 5 Middle Female     Own  Single    Close  68400        0    High       12
## 6  Young   Male     Own Married    Close  30400        0     Low        6
##   AmountSpent
## 1         755
## 2        1318
## 3         296
## 4        2436
## 5        1304
## 6         495

We have 3 numerical features and 6 categorical features. Now we move to data analysis part.

# 3.Exploratory Data Analysis * Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies, to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.

Let’s start exploring our data

duplicates <- d_mar[duplicated(d_mar), ]
duplicates
##  [1] Age         Gender      OwnHome     Married     Location    Salary     
##  [7] Children    History     Catalogs    AmountSpent
## <0 rows> (or 0-length row.names)

We don’t have duplicated data

3.1 Missing Value

any(is.na(d_mar))
## [1] TRUE
colSums(is.na(d_mar))
##         Age      Gender     OwnHome     Married    Location      Salary 
##           0           0           0           0           0           0 
##    Children     History    Catalogs AmountSpent 
##           0         303           0           0

In this step, we add another level to the History columns and replace the missing values with ‘Never’ to represent the customers who have not yet purchased and print out the levels.

levs <- levels(d_mar$History)
levs[length(levs)+1] <- "Never"
d_mar$History <- factor(d_mar$History, levels=levs)
d_mar$History[is.na(d_mar$History)] <- "Never"
d_mar$History <- ordered(d_mar$History, levels=c("Never", "Low", "Medium", "High"))
print('After replacing missing values NA with Never:')
## [1] "After replacing missing values NA with Never:"
levels(d_mar$History)
## [1] "Never"  "Low"    "Medium" "High"

Look at the distribution of customers according to History

table(d_mar$History)
## 
##  Never    Low Medium   High 
##   1000      0      0      0

3.2 Correlation Matrix

  • A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables.
numeric_columns <- sapply(d_dimar, is.numeric)
corr_matrix <- cor(d_dimar[, numeric_columns], use = "complete.obs")
corr_matrix
##                  Salary     Children    Catalogs AmountSpent     Gender_b
## Salary       1.00000000  0.049663163  0.18355086   0.6995957 -0.261492181
## Children     0.04966316  1.000000000 -0.11345543  -0.2223082  0.105469083
## Catalogs     0.18355086 -0.113455428  1.00000000   0.4726499 -0.087350767
## AmountSpent  0.69959571 -0.222308170  0.47264989   1.0000000 -0.201690213
## Gender_b    -0.26149218  0.105469083 -0.08735077  -0.2016902  1.000000000
## Married_b    0.67563308  0.009770249  0.13705989   0.4758800 -0.116057285
## Location_b   0.03712709 -0.002391455 -0.12858075  -0.2526157 -0.005553971
## Ownhome_b    0.46073640 -0.032274083  0.09313151   0.3508080 -0.084433317
## Age_y       -0.58857078  0.073527118 -0.15887218  -0.4346918  0.113978982
## Age_m        0.52905162  0.244719648  0.11408342   0.3013953 -0.204232847
## Hist_m      -0.01237083 -0.041939180  0.01604679  -0.1438307 -0.025799164
## Hist_h       0.52469002 -0.273361609  0.28493154   0.5903957 -0.160750813
##                Married_b   Location_b    Ownhome_b       Age_y       Age_m
## Salary       0.675633080  0.037127094  0.460736395 -0.58857078  0.52905162
## Children     0.009770249 -0.002391455 -0.032274083  0.07352712  0.24471965
## Catalogs     0.137059886 -0.128580754  0.093131508 -0.15887218  0.11408342
## AmountSpent  0.475879979 -0.252615659  0.350807999 -0.43469185  0.30139535
## Gender_b    -0.116057285 -0.005553971 -0.084433317  0.11397898 -0.20423285
## Married_b    1.000000000  0.006964058  0.264009318 -0.28328892  0.15595721
## Location_b   0.006964058  1.000000000  0.033691291 -0.03298183  0.04108406
## Ownhome_b    0.264009318  0.033691291  1.000000000 -0.46929874  0.25164907
## Age_y       -0.283288923 -0.032981833 -0.469298744  1.00000000 -0.64468192
## Age_m        0.155957211  0.041084058  0.251649074 -0.64468192  1.00000000
## Hist_m       0.027285078  0.088858299 -0.001919144 -0.13977934  0.09447173
## Hist_h       0.353280416 -0.207556702  0.277386271 -0.33568013  0.17191158
##                   Hist_m     Hist_h
## Salary      -0.012370830  0.5246900
## Children    -0.041939180 -0.2733616
## Catalogs     0.016046789  0.2849315
## AmountSpent -0.143830654  0.5903957
## Gender_b    -0.025799164 -0.1607508
## Married_b    0.027285078  0.3532804
## Location_b   0.088858299 -0.2075567
## Ownhome_b   -0.001919144  0.2773863
## Age_y       -0.139779338 -0.3356801
## Age_m        0.094471729  0.1719116
## Hist_m       1.000000000 -0.3034567
## Hist_h      -0.303456731  1.0000000
# Melt the correlation matrix for ggplot
melted_corr_matrix <- melt(corr_matrix)

# Create the heatmap plot and assign it to a variable
heatmap_plot <- ggplot(data = melted_corr_matrix, aes(Var1, Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1,1), space = "Lab", 
                       name="Pearson\nCorrelation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1),
        axis.text.y = element_text(size = 12)) +
  labs(x = '', y = '', title = 'Correlation Matrix') +
  geom_text(aes(label = sprintf("%.2f", value)), vjust = 1, size = 3)

# Print the plot with larger dimensions
print(heatmap_plot)

# Save the plot with larger dimensions
ggsave("heatmap_plot.png", plot = heatmap_plot, width = 10, height = 8, dpi = 300)

3.3 Describe Function

  • Generate descriptive statistics.
  • This function returns the count, mean, standard deviation,
  • minimum, maximum values and the quantiles of the data.
summary(d_mar)
##      Age               Gender            OwnHome            Married         
##  Length:1000        Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    Location             Salary          Children       History    
##  Length:1000        Min.   : 10100   Min.   :0.000   Never :1000  
##  Class :character   1st Qu.: 29975   1st Qu.:0.000   Low   :   0  
##  Mode  :character   Median : 53700   Median :1.000   Medium:   0  
##                     Mean   : 56104   Mean   :0.934   High  :   0  
##                     3rd Qu.: 77025   3rd Qu.:2.000                
##                     Max.   :168800   Max.   :3.000                
##     Catalogs      AmountSpent    
##  Min.   : 6.00   Min.   :  38.0  
##  1st Qu.: 6.00   1st Qu.: 488.2  
##  Median :12.00   Median : 962.0  
##  Mean   :14.68   Mean   :1216.8  
##  3rd Qu.:18.00   3rd Qu.:1688.5  
##  Max.   :24.00   Max.   :6217.0

Take a look at the structure of our data frame, looks good for a linear regression model.

str(d_mar)
## 'data.frame':    1000 obs. of  10 variables:
##  $ Age        : chr  "Old" "Middle" "Young" "Middle" ...
##  $ Gender     : chr  "Female" "Male" "Female" "Male" ...
##  $ OwnHome    : chr  "Own" "Rent" "Rent" "Own" ...
##  $ Married    : chr  "Single" "Single" "Single" "Married" ...
##  $ Location   : chr  "Far" "Close" "Close" "Close" ...
##  $ Salary     : int  47500 63600 13500 85600 68400 30400 48100 68400 51900 80700 ...
##  $ Children   : int  0 0 0 1 0 0 0 0 3 0 ...
##  $ History    : Ord.factor w/ 4 levels "Never"<"Low"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Catalogs   : int  6 6 18 18 12 6 12 18 6 18 ...
##  $ AmountSpent: int  755 1318 296 2436 1304 495 782 1155 158 3034 ...

4.1 Age

  • How is the age distribution ?
d_mar_Age <- d_mar %>%
  count(Age) %>%
  rename(count = n)
d_mar_Age
##      Age count
## 1 Middle   508
## 2    Old   205
## 3  Young   287
  • Most of the customers are in the middle age group
fig <- plot_ly(d_mar_Age, labels = ~Age, values = ~count, type = 'pie', hole = 0.4) %>%
  layout(title = 'Age Distribution', xaxis = list(title = 'Age'), yaxis = list(title = 'Count'))
fig
fig <- plot_ly(d_mar_Age, x = ~Age, y = ~count, type = 'bar', marker = list(colorscale = 'Viridis')) %>%
  layout(title = 'Age Distribution', xaxis = list(title = 'Age'), yaxis = list(title = 'Count'))
fig
  • How much is the average salary,amount spent by age?
d_mar_Age_Salary <- d_mar %>% group_by(Age) %>% summarise(AVG_Salary = round(mean(Salary, na.rm = TRUE), 2))
d_mar_Age_AmountSpent <- d_mar %>% group_by(Age) %>% summarise(AVG_AmountSpent = round(mean(AmountSpent, na.rm = TRUE), 2))

result <- merge(d_mar_Age_Salary, d_mar_Age_AmountSpent, by = "Age")
result
##      Age AVG_Salary AVG_AmountSpent
## 1 Middle   72036.42         1501.69
## 2    Old   56365.85         1432.13
## 3  Young   27715.68          558.62
fig <- subplot(
  plot_ly(result, x = ~Age, y = ~AVG_Salary, type = 'bar', name = 'Mean Salary', marker = list(colorscale = 'fall')),
  plot_ly(result, x = ~Age, y = ~AVG_AmountSpent, type = 'bar', name = 'Mean Amount Spent', marker = list(colorscale = 'fall')),
  nrows = 2
) %>% layout(title = 'Age', xaxis = list(title = 'Age'), yaxis = list(title = 'Value'))
fig
  • Middle age group earns more and spends more
# Plot 1: Mean Salary
p1 <- ggplot(result, aes(x = Age, y = AVG_Salary, fill = AVG_Salary)) +
  geom_bar(stat = "identity") +
  scale_fill_gradientn(colours = rainbow(4)) +
  labs(title = "Age AVG Salary", x = "Age", y = "AVG Salary") +
  theme_minimal()

# Plot 2: Mean Amount Spent
p2 <- ggplot(result, aes(x = Age, y = AVG_AmountSpent, fill = AVG_AmountSpent)) +
  geom_bar(stat = "identity") +
  scale_fill_gradientn(colours = rainbow(4)) +
  labs(title = "Age AVG Amount Spent", x = "Age", y = "AVG Amount Spent") +
  theme_minimal()

# Combine plots
grid.arrange(p1, p2, nrow = 2)

# Scatter plot with trendline
p <- ggplot(d_mar, aes(x = Salary, y = AmountSpent, color = as.factor(Age))) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  scale_color_brewer(palette = "Set1") +
  labs(title = 'Age With Salary Vs Amount Spent', x = 'Salary', y = 'Amount Spent', color = 'Age')

print(p)
## `geom_smooth()` using formula = 'y ~ x'

Summary

  • Most of the customers are in the middle age group
  • Middle age group earns more and spends more

4.2 Gender

  • How is the gender distribution ?
d_mar_Gender <- d_mar %>%
  count(Gender) %>%
  rename(count = n)
d_mar_Gender
##   Gender count
## 1 Female   506
## 2   Male   494
  • Gender distribution is balanced
fig <- plot_ly(d_mar_Gender, x = ~Gender, y = ~count, type = 'bar', marker = list(color = ~count, colorscale = 'Viridis'), text = ~count, textposition = 'outside') %>%
  layout(title = 'Gender Distribution', xaxis = list(title = 'Gender'), yaxis = list(title = 'Count'))
fig
  • How much is the average salary, amount spent by Gender
# Calculate average salary and amount spent by gender
d_mar_Gender_Salary <- d_mar %>%
  group_by(Gender) %>%
  summarise(AVG_Salary = round(mean(Salary, na.rm = TRUE), 2))

d_mar_Gender_AmountSpent <- d_mar %>%
  group_by(Gender) %>%
  summarise(AVG_AmountSpent = round(mean(AmountSpent, na.rm = TRUE), 2))

# Combine the two data frames
result <- merge(d_mar_Gender_Salary, d_mar_Gender_AmountSpent, by = "Gender")

# Create individual plots
plot1 <- ggplot(result, aes(x = Gender, y = AVG_Salary, fill = Gender)) + 
  geom_bar(stat = "identity") +
  labs(title = "Gender AVG Salary", x = "Gender", y = "Average Salary") +
  theme_minimal()

plot2 <- ggplot(result, aes(x = Gender, y = AVG_AmountSpent, fill = Gender)) + 
  geom_bar(stat = "identity") +
  labs(title = "Gender AVG Amount Spent", x = "Gender", y = "Average Amount Spent") +
  theme_minimal()

# Combine plots into a single figure with subplots
grid.arrange(plot1, plot2, nrow = 2)

  • Men earn more and spend more
# Scatter plot with trendline
ggplot(d_mar, aes(x = Salary, y = AmountSpent, color = Gender)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +  # Linear model trendline without confidence interval
  scale_color_brewer(palette = "Set1") +    # Color by Gender
  labs(title = 'Gender With Salary Vs Amount Spent', x = 'Salary', y = 'Amount Spent') +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

  • How is the age distribution by gender?
d_mar_G_and_A <- d_mar %>%
  count(Gender, Age) %>%
  rename(count = n)
d_mar_G_and_A
##   Gender    Age count
## 1 Female Middle   206
## 2 Female    Old   129
## 3 Female  Young   171
## 4   Male Middle   302
## 5   Male    Old    76
## 6   Male  Young   116
  • The highest number of customers is middle age men
  • Lowest number of customers older men
d_mar_G_and_A <- d_mar %>%
  group_by(Gender, Age) %>%
  summarise(count = n())
## `summarise()` has grouped output by 'Gender'. You can override using the
## `.groups` argument.
ggplot(d_mar_G_and_A, aes(x = Age, y = count, fill = Gender)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Age Count With Gender", x = "Age", y = "Count") +
  theme_minimal()

  • How much is the average(min,max) spending amount of customers by gender,age?
d_mar_G_and_A <- d_mar %>%
  group_by(Gender, Age) %>%
  summarise(AVG_AmountSpent = mean(AmountSpent, na.rm = TRUE)) %>%
  mutate(AVG_AmountSpent = round(AVG_AmountSpent, 2))
## `summarise()` has grouped output by 'Gender'. You can override using the
## `.groups` argument.
d_mar_G_and_A
## # A tibble: 6 × 3
## # Groups:   Gender [2]
##   Gender Age    AVG_AmountSpent
##   <chr>  <chr>            <dbl>
## 1 Female Middle           1301.
## 2 Female Old              1279.
## 3 Female Young             501.
## 4 Male   Middle           1638.
## 5 Male   Old              1692.
## 6 Male   Young             643.

Highest average spending:Male Old 1691
Lowest average spending: Female Young 501

# Calculating various statistics by Gender and Age
d_mar_G_and_A_AVG <- d_mar %>% 
  group_by(Gender, Age) %>% 
  summarise(AVG_AmountSpent = round(mean(AmountSpent, na.rm = TRUE), 2))
## `summarise()` has grouped output by 'Gender'. You can override using the
## `.groups` argument.
d_mar_G_and_A_Max <- d_mar %>% 
  group_by(Gender, Age) %>% 
  summarise(Max_AmountSpent = max(AmountSpent, na.rm = TRUE))
## `summarise()` has grouped output by 'Gender'. You can override using the
## `.groups` argument.
d_mar_G_and_A_Min <- d_mar %>% 
  group_by(Gender, Age) %>% 
  summarise(Min_AmountSpent = min(AmountSpent, na.rm = TRUE))
## `summarise()` has grouped output by 'Gender'. You can override using the
## `.groups` argument.
d_mar_G_and_A_Count <- d_mar %>% 
  group_by(Gender, Age) %>% 
  summarise(Count = n())
## `summarise()` has grouped output by 'Gender'. You can override using the
## `.groups` argument.
# Merging the data frames
result <- reduce(list(d_mar_G_and_A_AVG, d_mar_G_and_A_Max, d_mar_G_and_A_Min, d_mar_G_and_A_Count), full_join, by = c("Gender", "Age"))
result
## # A tibble: 6 × 6
## # Groups:   Gender [2]
##   Gender Age    AVG_AmountSpent Max_AmountSpent Min_AmountSpent Count
##   <chr>  <chr>            <dbl>           <int>           <int> <int>
## 1 Female Middle           1301.            5830             158   206
## 2 Female Old              1279.            5564              65   129
## 3 Female Young             501.            3688              47   171
## 4 Male   Middle           1638.            5878             157   302
## 5 Male   Old              1692.            6217             297    76
## 6 Male   Young             643.            1692              38   116
# Reshape the data from wide to long format for faceting
long_result <- result %>%
  gather(key = "Statistic", value = "Value", AVG_AmountSpent, Min_AmountSpent, Max_AmountSpent, Count)

# Create the plot
ggplot(long_result, aes(x = paste(Gender, Age), y = Value, fill = Statistic)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  facet_wrap(~ Statistic, scales = "free_y", ncol = 1) +
  theme_minimal() +
  labs(title = "Gender Age With Amount Spent", x = "Gender and Age", y = "Value") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Summary

  • Gender distribution is balanced

  • Men earn more and spend more

  • The highest number of customers is middle age men

  • Lowest number of customers older men

  • Highest average spending:Male Old 1691

  • Lowest average spending: Female Young 501

  • Highest average Salary:Male Middle 76.3 k

  • Lowest average Salary: Female Young 25.5 k

4.3 OwnHome

  • How is the OwnHome distribution ?
d_mar_OwnHome <- d_mar %>%
  count(OwnHome) %>%
  rename(count = n)
d_mar_OwnHome
##   OwnHome count
## 1     Own   516
## 2    Rent   484
# Assuming d_mar_OwnHome is already created and contains 'OwnHome' and 'count' columns
fig <- plot_ly(d_mar_OwnHome, labels = ~OwnHome, values = ~count, type = 'pie', marker = list(colors = c('darkblue', 'darkcyan'))) %>%
  layout(title = 'Own Home Count')
fig
  • OwnHome distribution is balanced

  • What are the customers’ average income and expenses by OwnHome?

# Calculate average salary and amount spent by OwnHome
d_mar_OwnHome_Salary <- d_mar %>%
  group_by(OwnHome) %>%
  summarise(AVG_Salary = round(mean(Salary, na.rm = TRUE), 2))

d_mar_OwnHome_AmountSpent <- d_mar %>%
  group_by(OwnHome) %>%
  summarise(AVG_AmountSpent = round(mean(AmountSpent, na.rm = TRUE), 2))

# Combine the data
result <- merge(d_mar_OwnHome_Salary, d_mar_OwnHome_AmountSpent, by = "OwnHome")

# Plotting using ggplot2 and gridExtra
plot1 <- ggplot(result, aes(x = OwnHome, y = AVG_Salary, fill = OwnHome)) + 
  geom_bar(stat = "identity") +
  labs(title = "Own Home AVG Salary", x = "Own Home", y = "Average Salary") +
  theme_minimal()

plot2 <- ggplot(result, aes(x = OwnHome, y = AVG_AmountSpent, fill = OwnHome)) + 
  geom_bar(stat = "identity") +
  labs(title = "Own Home AVG Amount Spent", x = "Own Home", y = "Average Amount Spent") +
  theme_minimal()

library(gridExtra)
grid.arrange(plot1, plot2, nrow = 2)

  • Homeowners earn more and spend more
ggplot(d_mar, aes(x = Salary, y = AmountSpent, color = OwnHome)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +  # Linear model trendline
  scale_color_brewer(palette = "Set1") +    # Color by OwnHome
  labs(title = 'Own Home With Salary Vs Amount Spent', x = 'Salary', y = 'Amount Spent') +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Summary

  • OwnHome distribution is balanced
  • Homeowners earn more and spend more

4.4 Married

  • How is the Married distribution ?
d_mar_Married <- d_mar %>%
  count(Married) %>%
  rename(count = n)

fig <- plot_ly(d_mar_Married, labels = ~Married, values = ~count, type = 'pie', marker = list(colors = c('darkblue', 'darkcyan'))) %>%
  layout(title = 'Married Count')
fig
  • Married distribution is balanced
# Calculating average salary and amount spent by Married status
d_mar_Married_Salary <- d_mar %>%
  group_by(Married) %>%
  summarise(AVG_Salary = round(mean(Salary, na.rm = TRUE), 2))

d_mar_Married_AmountSpent <- d_mar %>%
  group_by(Married) %>%
  summarise(AVG_AmountSpent = round(mean(AmountSpent, na.rm = TRUE), 2))

# Combine the data
result <- merge(d_mar_Married_Salary, d_mar_Married_AmountSpent, by = "Married")

# Plotting using ggplot2 and gridExtra
plot1 <- ggplot(result, aes(x = Married, y = AVG_Salary, fill = Married)) + 
  geom_bar(stat = "identity") +
  labs(title = "Married AVG Salary", x = "Married", y = "Average Salary") +
  theme_minimal()

plot2 <- ggplot(result, aes(x = Married, y = AVG_AmountSpent, fill = Married)) + 
  geom_bar(stat = "identity") +
  labs(title = "Married AVG Amount Spent", x = "Married", y = "Average Amount Spent") +
  theme_minimal()

grid.arrange(plot1, plot2, nrow = 2)

  • Married people earn more and spend more
ggplot(d_mar, aes(x = Salary, y = AmountSpent, color = Married)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +  # Linear model trendline
  scale_color_brewer(palette = "Set1") +    # Color by Married status
  labs(title = 'Married With Salary Vs Amount Spent', x = 'Salary', y = 'Amount Spent') +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Summary

  • Married distribution is balanced
  • Married people earn more and spend more

4.5 Location

How is the Location distribution ?

d_mar_Location <- d_mar %>%
  count(Location) %>%
  rename(count = n)

ggplot(d_mar_Location, aes(x = Location, y = count, fill = count)) +
  geom_bar(stat = "identity") +
  scale_fill_viridis_c() +
  labs(title = "Location Distribution", x = "Location", y = "Count") +
  theme_minimal() +
  geom_text(aes(label = count), vjust = -0.5)

# Boxplot for Salary by Location
plot_salary <- ggplot(d_mar, aes(x = Location, y = Salary)) +
  geom_boxplot() +
  labs(title = "Salary Distribution by Location", x = "Location", y = "Salary") +
  theme_minimal()

# Boxplot for AmountSpent by Location
plot_amount_spent <- ggplot(d_mar, aes(x = Location, y = AmountSpent)) +
  geom_boxplot() +
  labs(title = "Amount Spent Distribution by Location", x = "Location", y = "Amount Spent") +
  theme_minimal()

# Arrange the plots
grid.arrange(plot_salary, plot_amount_spent, nrow = 2)

ggplot(d_mar, aes(x = Salary, y = AmountSpent, color = Location)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +  # Linear model trendline
  scale_color_brewer(palette = "Set1") +    # Color by Location
  labs(title = 'Location With Salary Vs Amount Spent', x = 'Salary', y = 'Amount Spent') +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Summary

  • Most customers are close to the nearest physical store that sells similar products
  • Customers who are close to the nearest physical store selling similar products have lower spend, although their income is higher

4.6 Children

  • How is the Children distribution ?
d_mar_Children <- d_mar %>%
  count(Children) %>%
  rename(count = n)
d_mar_Children
##   Children count
## 1        0   462
## 2        1   267
## 3        2   146
## 4        3   125
fig <- plot_ly(d_mar_Children, labels = ~Children, values = ~count, type = 'pie', marker = list(colors = c('darkblue', 'darkcyan', 'CadetBlue', 'DarkSeaGreen'))) %>%
  layout(title = 'Children Count')
fig
fig <- plot_ly(d_mar_Children, x = ~Children, y = ~count, type = 'scatter', mode = 'markers', marker = list(color = ~count, size = ~count * 0.1, showscale = TRUE)) %>%
  layout(title = 'Children Distribution', xaxis = list(title = 'Children Count'), yaxis = list(title = 'Number Of Customers'))
fig
  • 46 percent of customers don’t have Children

  • What are the customers’ average income and expenses by Children?

# Calculating average salary and amount spent by Children
d_mar_Children_Salary <- d_mar %>%
  group_by(Children) %>%
  summarise(AVG_Salary = round(mean(Salary, na.rm = TRUE), 2))

d_mar_Children_AmountSpent <- d_mar %>%
  group_by(Children) %>%
  summarise(AVG_AmountSpent = round(mean(AmountSpent, na.rm = TRUE), 2))

# Combine the data
result <- merge(d_mar_Children_Salary, d_mar_Children_AmountSpent, by = "Children")

# Plotting using ggplot2 and gridExtra
plot_salary <- ggplot(result, aes(x = Children, y = AVG_Salary)) +
  geom_bar(stat = "identity", aes(fill = Children)) +
  labs(title = "Children AVG Salary", x = "Number of Children", y = "Average Salary") +
  theme_minimal()

plot_amount_spent <- ggplot(result, aes(x = Children, y = AVG_AmountSpent)) +
  geom_bar(stat = "identity", aes(fill = Children)) +
  labs(title = "Children AVG Amount Spent", x = "Number of Children", y = "Average Amount Spent") +
  theme_minimal()

# Arrange the plots
grid.arrange(plot_salary, plot_amount_spent, nrow = 2)

  • Although customers’ incomes were close, there was a decrease in spending as the number of children increased
# Scatter plot with trendline for Salary vs Amount Spent by Children
ggplot(d_mar, aes(x = Salary, y = AmountSpent, color = as.factor(Children))) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +  # Linear model trendline
  scale_color_brewer(palette = "Set1") +    # Color by number of Children
  labs(title = 'Children With Salary Vs Amount Spent', x = 'Salary', y = 'Amount Spent') +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Summary

  • 46 percent of customers don’t have Children
  • Although customers’ incomes were close, there was a decrease in spending as the number of children increased

4.7 History

  • How is the History distribution ?
# Calculate average salary and amount spent by History
d_mar_History_Salary <- d_mar %>%
  group_by(History) %>%
  summarise(AVG_Salary = round(mean(Salary, na.rm = TRUE), 2))

d_mar_History_AmountSpent <- d_mar %>%
  group_by(History) %>%
  summarise(AVG_AmountSpent = round(mean(AmountSpent, na.rm = TRUE), 2))

# Combine the data
result <- merge(d_mar_History_Salary, d_mar_History_AmountSpent, by = "History")

# Plotting using ggplot2 and gridExtra
plot_salary <- ggplot(result, aes(x = History, y = AVG_Salary, fill = History)) +
  geom_bar(stat = "identity") +
  labs(title = "History AVG Salary", x = "History", y = "Average Salary") +
  theme_minimal()

plot_amount_spent <- ggplot(result, aes(x = History, y = AVG_AmountSpent, fill = History)) +
  geom_bar(stat = "identity") +
  labs(title = "History AVG Amount Spent", x = "History", y = "Average Amount Spent") +
  theme_minimal()

grid.arrange(plot_salary, plot_amount_spent, nrow = 2)

  • Customers with high previous purchasing volume are the group with the highest income and expenditure
# Scatter plot with trendline for Salary vs Amount Spent by History
ggplot(d_mar, aes(x = Salary, y = AmountSpent, color = History)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +  # Linear model trendline
  scale_color_brewer(palette = "Set1") +    # Color by History
  labs(title = 'History With Salary Vs Amount Spent', x = 'Salary', y = 'Amount Spent') +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Summary

  • The highest number of customers who prefer us for the first time
  • Customers with high previous purchasing volume are the group with the highest income and expenditure

4.8 Catalogs

  • How is the Catalogs distribution ?
d_mar_Catalogs <- d_mar %>%
  count(Catalogs) %>%
  rename(count = n)
d_mar_Catalogs
##   Catalogs count
## 1        6   252
## 2       12   282
## 3       18   233
## 4       24   233
fig <- plot_ly(d_mar_Catalogs, labels = ~Catalogs, values = ~count, type = 'pie', marker = list(colors = c('darkblue', 'darkcyan', 'CadetBlue', 'DarkSeaGreen'))) %>%
  layout(title = 'Catalogs Count')
fig
  • Catalogs distribution is balanced
# Calculating average salary and amount spent by Catalogs
d_mar_Catalogs_Salary <- d_mar %>%
  group_by(Catalogs) %>%
  summarise(AVG_Salary = round(mean(Salary, na.rm = TRUE), 2))

d_mar_Catalogs_AmountSpent <- d_mar %>%
  group_by(Catalogs) %>%
  summarise(AVG_AmountSpent = round(mean(AmountSpent, na.rm = TRUE), 2))

# Combine the data
result <- merge(d_mar_Catalogs_Salary, d_mar_Catalogs_AmountSpent, by = "Catalogs")

# Plotting using ggplot2 and gridExtra
library(ggplot2)
library(gridExtra)

plot_salary <- ggplot(result, aes(x = Catalogs, y = AVG_Salary)) +
  geom_bar(stat = "identity", aes(fill = Catalogs)) +
  labs(title = "Catalogs AVG Salary", x = "Catalogs", y = "Average Salary") +
  theme_minimal()

plot_amount_spent <- ggplot(result, aes(x = Catalogs, y = AVG_AmountSpent)) +
  geom_bar(stat = "identity", aes(fill = Catalogs)) +
  labs(title = "Catalogs AVG Amount Spent", x = "Catalogs", y = "Average Amount Spent") +
  theme_minimal()

grid.arrange(plot_salary, plot_amount_spent, nrow = 2)

  • As the number of catalogs sent increases, so does the expenditure and income
ggplot(d_mar, aes(x = Salary, y = AmountSpent, color = as.factor(Catalogs))) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +  # Linear model trendline
  scale_color_brewer(palette = "Set1") +    # Color by Catalogs
  labs(title = 'Catalogs With Salary Vs Amount Spent', x = 'Salary', y = 'Amount Spent') +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Summary

  • Catalogs distribution is balanced
  • As the number of catalogs sent increases, so does the expenditure and income

4.9 AmountSpent

  • How is the AmountSpent distribution ?
ggplot(d_mar, aes(x = AmountSpent)) +
  geom_histogram(binwidth = 200, fill = "red", color = "black") +
  scale_x_continuous(limits = c(0, 6000), breaks = seq(0, 6000, by = 200)) +
  labs(title = "Distribution Of Amount Spent", x = "Amount Spent", y = "Counts") +
  theme_minimal()
## Warning: Removed 1 rows containing non-finite values (`stat_bin()`).
## Warning: Removed 2 rows containing missing values (`geom_bar()`).

ggplot(d_mar, aes(x = "", y = AmountSpent)) +
  geom_boxplot(fill = "royalblue", color = "black") +
  stat_summary(fun.data = "mean_sdl", geom = "errorbar", color = "red", width = 0.5) +
  labs(title = "Amount Spent Distribution", y = "Amount Spent") +
  theme_minimal()

ggplot(d_mar, aes(x = factor(1), y = AmountSpent)) +
  geom_violin(fill = "lightseagreen", color = "black", alpha = 0.6) +
  geom_boxplot(width = 0.1, fill = "white", color = "black") +
  labs(title = "Amount Spent Distribution", y = "Amount Spent") +
  theme_minimal() +
  theme(axis.title.x = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank())

Summary

  • Amount Spent max:6217
  • Amount Spent mean:1216
  • Amount Spent median:962
  • Amount Spent min:38

4.10 Salary

  • How is the Salary distribution ?
ggplot(d_mar, aes(x = Salary)) +
  geom_histogram(binwidth = 5000, fill = "red", color = "black") +
  scale_x_continuous(limits = c(0, 150000), breaks = seq(0, 150000, by = 5000)) +
  labs(title = "Distribution Of Salary", x = "Salary", y = "Counts") +
  theme_minimal()
## Warning: Removed 1 rows containing non-finite values (`stat_bin()`).
## Warning: Removed 2 rows containing missing values (`geom_bar()`).

ggplot(d_mar, aes(x = factor(1), y = Salary)) +
  geom_boxplot(fill = "royalblue", color = "black") +
  stat_summary(fun.data = "mean_sdl", geom = "errorbar", color = "red", width = 0.5) +
  labs(title = "Salary Distribution") +
  theme_minimal() +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

ggplot(d_mar, aes(x = factor(1), y = Salary)) +
  geom_violin(fill = "lightseagreen", color = "black", alpha = 0.6) +
  geom_boxplot(width = 0.1, fill = "white", color = "black") +
  labs(title = "Salary Distribution") +
  theme_minimal() +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

Summary

  • Salary max:168.8 k
  • Salary mean:56.1 k
  • Salary median:53.7 k
  • Salary min:10.1 k

5. Data summary

  • Data has only object and integer values.

  • Dataset comprises of 1000 observations and 10 characteristics.

  • We don’t have duplicated data

  • We have already said in the data set that null values are customers who have not exchanged with us in the past, we do not have missing data.

  • Most of the customers are in the middle age group

  • Middle age group earns more and spends more

  • Gender distribution is balanced

  • Men earn more and spend more

  • The highest number of customers is middle age men

  • Lowest number of customers older men

  • Highest average spending:Male Old 1691

  • Lowest average spending: Female Young 501

  • Highest average Salary:Male Middle 76.3 k

  • Lowest average Salary: Female Young 25.5 k

  • OwnHome distribution is balanced

  • Homeowners earn more and spend more

  • Married distribution is balanced

  • Married people earn more and spend more

  • Most customers are close to the nearest physical store that sells similar products

  • Customers who are close to the nearest physical store selling similar products have lower spend , although their income is higher

  • 46 percent of customers don’t have Children

  • Although customers’ incomes were close, there was a decrease in spending as the number of children increased

  • The highest number of customers who prefer us for the first time

  • Customers with high previous purchasing volume are the group with the highest income and expenditure

  • Catalogs distribution is balanced

  • As the number of catalogs sent increases, so does the expenditure and income

  • Amount Spent max:6217

  • Amount Spent mean:1216

  • Amount Spent median:962

  • Amount Spent min:38

  • Salary max:168.8 k

  • Salary mean:56.1 k

  • Salary median:53.7 k

  • Salary min:10.1 k

6. Some regressions

In this part, we will come back to d_dimar dataframe containing 18 columns, and the last 8 columns is used for linear regression. Firstly, we will remind you about d_dimar dataframe:

head(d_dimar)
##      Age Gender OwnHome Married Location Salary Children History Catalogs
## 1    Old Female     Own  Single      Far  47500        0    High        6
## 2 Middle   Male    Rent  Single    Close  63600        0    High        6
## 3  Young Female    Rent  Single    Close  13500        0     Low       18
## 4 Middle   Male     Own Married    Close  85600        1    High       18
## 5 Middle Female     Own  Single    Close  68400        0    High       12
## 6  Young   Male     Own Married    Close  30400        0     Low        6
##   AmountSpent Gender_b Married_b Location_b Ownhome_b Age_y Age_m Hist_m Hist_h
## 1         755        1         0          0         1     0     0      0      1
## 2        1318        0         0          1         0     0     1      0      1
## 3         296        1         0          1         0     1     0      0      0
## 4        2436        0         1          1         1     0     1      0      1
## 5        1304        1         0          1         1     0     1      0      1
## 6         495        0         1          1         1     1     0      0      0

Now, let’s see again the correlation matrix between quantitative attributes:

# Melt the correlation matrix for ggplot
melted_corr_matrix <- melt(corr_matrix)

# Create the heatmap plot and assign it to a variable
heatmap_plot <- ggplot(data = melted_corr_matrix, aes(Var1, Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1,1), space = "Lab", 
                       name="Pearson\nCorrelation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1),
        axis.text.y = element_text(size = 12)) +
  labs(x = '', y = '', title = 'Correlation Matrix') +
  geom_text(aes(label = sprintf("%.2f", value)), vjust = 1, size = 3)

# Print the plot with larger dimensions
print(heatmap_plot)

# Save the plot with larger dimensions
ggsave("heatmap_plot.png", plot = heatmap_plot, width = 10, height = 8, dpi = 300)

6.1. Simple linear regression

6.1.1. AmountSpent ~ Salary

par(mfrow=c(2,1))
plot(density(d_mar$AmountSpent), main="Density-AmountSpent", xlab="Amount Spent")
plot(density(d_mar$Salary), main="Density-Salary", xlab="Salary")

We would expect a linear relation in Amount spent and Salary of customers. Let us see if it is in fact true.

Now consider the familiar model:

\(AmountSpent = \beta_0 + \beta_1 * Salary\)

fit <- lm(AmountSpent ~ Salary, data = d_dimar)
summary(fit)
## 
## Call:
## lm(formula = AmountSpent ~ Salary, data = d_dimar)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2179.7  -315.2   -53.5   279.7  3752.9 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -15.31783   45.37416  -0.338    0.736    
## Salary        0.02196    0.00071  30.930   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 687.1 on 998 degrees of freedom
## Multiple R-squared:  0.4894, Adjusted R-squared:  0.4889 
## F-statistic: 956.7 on 1 and 998 DF,  p-value: < 2.2e-16

The intercept could be removed since it is not significant from the test

fit <- lm(AmountSpent ~ 0 + Salary, data = d_dimar)
summary(fit)
## 
## Call:
## lm(formula = AmountSpent ~ 0 + Salary, data = d_dimar)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2159.5  -322.0   -60.8   277.5  3761.4 
## 
## Coefficients:
##         Estimate Std. Error t value Pr(>|t|)    
## Salary 0.0217504  0.0003398      64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 686.8 on 999 degrees of freedom
## Multiple R-squared:  0.8039, Adjusted R-squared:  0.8038 
## F-statistic:  4097 on 1 and 999 DF,  p-value: < 2.2e-16

The second model, which forces the line through the origin (no intercept), suggests a stronger relationship between Salary and AmountSpent, evidenced by a higher R-squared value (0.8039 vs. 0.4894). This could indicate that the true relationship may indeed pass through the origin, or it may be an artifact of this particular dataset. However, excluding the intercept can sometimes lead to misleading interpretations, and it’s crucial to consider whether it makes theoretical sense for the intercept to be zero in the context of the data.

6.1.2. AmountSpent ~ Children

# Scatter plot with trendline
p <- ggplot(d_mar, aes(x = Salary, y = AmountSpent, color = as.factor(Age))) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  scale_color_brewer(palette = "Set1") +
  labs(title = 'Age With Salary Vs Amount Spent', x = 'Salary', y = 'Amount Spent', color = 'Age')

print(p)
## `geom_smooth()` using formula = 'y ~ x'

fit <- lm(AmountSpent ~ Children, data = d_dimar)
summary(fit)
## 
## Call:
## lm(formula = AmountSpent ~ Children, data = d_dimar)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1300.6  -669.7  -253.4   422.4  4810.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1406.63      39.67  35.460  < 2e-16 ***
## Children     -203.27      28.22  -7.203 1.16e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 937.5 on 998 degrees of freedom
## Multiple R-squared:  0.04942,    Adjusted R-squared:  0.04847 
## F-statistic: 51.89 on 1 and 998 DF,  p-value: 1.157e-12
fit <- lm(AmountSpent ~ 0 + Salary, data = d_dimar)
summary(fit)
## 
## Call:
## lm(formula = AmountSpent ~ 0 + Salary, data = d_dimar)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2159.5  -322.0   -60.8   277.5  3761.4 
## 
## Coefficients:
##         Estimate Std. Error t value Pr(>|t|)    
## Salary 0.0217504  0.0003398      64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 686.8 on 999 degrees of freedom
## Multiple R-squared:  0.8039, Adjusted R-squared:  0.8038 
## F-statistic:  4097 on 1 and 999 DF,  p-value: < 2.2e-16

Model 1, predicting Amount Spent based on the number of Children, has a low R-squared, indicating that Children alone poorly predict spending.

Model 2, using Salary without an intercept, shows a much higher R-squared, suggesting Salary is a strong predictor of spending.

The significant negative coefficient for Children in Model 1 indicates that as the number of children increases, the amount spent decreases. However, the explanatory power of Salary on Amount Spent is far greater than that of the number of Children, as seen in the difference in R-squared values (0.8039 vs. 0.04942).

6.2. Multiple linear regressions

6.2.1. AmountSpent ~ History (Medium, High)

First, we will plot the density of amount spent based on the customer history.

ggplot(d_mar, aes(x=AmountSpent)) + geom_density(aes(group=History, fill=History), alpha=.3)

We can observe and infer a couple of things here. Firstly the obvious - customers with a High History tend to spend more and those with a low history tend to spend low. Secondly, the ones whoch we thought did not have a history also have a record of spending some amount. Thus, our assumption is incorrect. However, we will keep Never as a category and build models on it since we do not have any detailed information about the predictor. In a practical scenario, we’d go back and get more information about how this data as classified while storing it.

Next, we’ll look at the amount spend based on the age of customers.

fit <- lm(AmountSpent ~ Hist_m+Hist_h, data = d_dimar)
summary(fit)
## 
## Call:
## lm(formula = AmountSpent ~ Hist_m + Hist_h, data = d_dimar)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1621.1  -501.3  -219.2   315.9  4125.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   858.95      33.60  25.566   <2e-16 ***
## Hist_m         91.45      62.98   1.452    0.147    
## Hist_h       1327.19      59.06  22.472   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 775.6 on 997 degrees of freedom
## Multiple R-squared:  0.3499, Adjusted R-squared:  0.3486 
## F-statistic: 268.4 on 2 and 997 DF,  p-value: < 2.2e-16
fit <- lm(AmountSpent ~0+ Hist_m+Hist_h, data = d_dimar)
summary(fit)
## 
## Call:
## lm(formula = AmountSpent ~ 0 + Hist_m + Hist_h, data = d_dimar)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1621.1   -59.2   316.5   747.4  4984.0 
## 
## Coefficients:
##        Estimate Std. Error t value Pr(>|t|)    
## Hist_m   950.40      68.51   13.87   <2e-16 ***
## Hist_h  2186.14      62.47   35.00   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 997.5 on 998 degrees of freedom
## Multiple R-squared:  0.5868, Adjusted R-squared:  0.586 
## F-statistic: 708.6 on 2 and 998 DF,  p-value: < 2.2e-16

Model 1, with an intercept, explains 35% of the variance in Amount Spent using customer history categories (medium and high), but the medium history variable is not a significant predictor. Model 2, without an intercept, attributes all of the Amount Spent variance to history categories and explains a higher variance (58.68%), with both history categories as significant predictors. The higher R-squared in Model 2 suggests that when we force the regression through the origin, the history categories alone account for more of the variance in spending, indicating their strong individual contributions to the model. However, caution is needed as omitting the intercept can lead to misestimation of effects.

6.2.2. AmountSpent ~ Age (Young, Middle, Old)

ggplot(d_mar, aes(x=AmountSpent)) + geom_density(aes(group=Age, fill=Age), alpha=.3)

fit <- lm(AmountSpent ~ Age_m+Age_y, data = d_dimar)
summary(fit)
## 
## Call:
## lm(formula = AmountSpent ~ Age_m + Age_y, data = d_dimar)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1367.1  -546.1  -152.1   390.6  4784.9 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1432.13      60.48  23.678   <2e-16 ***
## Age_m          69.56      71.65   0.971    0.332    
## Age_y        -873.50      79.19 -11.030   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 866 on 997 degrees of freedom
## Multiple R-squared:  0.1897, Adjusted R-squared:  0.1881 
## F-statistic: 116.7 on 2 and 997 DF,  p-value: < 2.2e-16
fit <- lm(AmountSpent ~ 0+ Age_m+Age_y, data = d_dimar)
summary(fit)
## 
## Call:
## lm(formula = AmountSpent ~ 0 + Age_m + Age_y, data = d_dimar)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1344.7  -368.9    55.9   780.1  6217.0 
## 
## Coefficients:
##       Estimate Std. Error t value Pr(>|t|)    
## Age_m  1501.69      48.00  31.285   <2e-16 ***
## Age_y   558.62      63.86   8.747   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1082 on 998 degrees of freedom
## Multiple R-squared:  0.5139, Adjusted R-squared:  0.513 
## F-statistic: 527.6 on 2 and 998 DF,  p-value: < 2.2e-16

Model 1 indicates that the age category ‘young’ significantly predicts Amount Spent negatively, while ‘middle’ age has no significant effect. The model explains 18.97% of the variance in spending. Model 2, without an intercept, finds both age categories to be significant predictors, with ‘middle’ age having a positive association with spending. This model accounts for a higher variance in spending (51.39%). The absence of an intercept suggests that Age alone is believed to explain all variations in spending, which may not be realistic, but it indicates the strong influence of age categories on spending behavior.

6.2.3. AmountSpent ~ Age, Salary, OwnHome, Gender

The next plot shows the distribution of salaries of customers based on gender and age.

ggplot(data=d_mar, aes(x=Age, y=mean(Salary))) + geom_col() + facet_grid(vars(OwnHome), vars(Gender)) +  ylab("Avg. Salary")

fit <- lm(AmountSpent ~ Age_m+Age_y + Salary + Ownhome_b + Gender_b , data = d_dimar)
summary(fit)
## 
## Call:
## lm(formula = AmountSpent ~ Age_m + Age_y + Salary + Ownhome_b + 
##     Gender_b, data = d_dimar)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2437.7  -340.8   -49.6   277.6  3494.4 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.220e+02  8.234e+01   2.696  0.00713 ** 
## Age_m       -2.879e+02  5.896e+01  -4.884 1.21e-06 ***
## Age_y       -2.340e+02  7.098e+01  -3.296  0.00102 ** 
## Salary       2.195e-02  9.537e-04  23.016  < 2e-16 ***
## Ownhome_b    2.227e+01  5.093e+01   0.437  0.66201    
## Gender_b    -6.897e+01  4.489e+01  -1.536  0.12475    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 679.1 on 994 degrees of freedom
## Multiple R-squared:  0.5033, Adjusted R-squared:  0.5008 
## F-statistic: 201.4 on 5 and 994 DF,  p-value: < 2.2e-16
fit <- lm(AmountSpent ~ 0+ Age_m+Age_y + Salary + Ownhome_b + Gender_b , data = d_dimar)
summary(fit)
## 
## Call:
## lm(formula = AmountSpent ~ 0 + Age_m + Age_y + Salary + Ownhome_b + 
##     Gender_b, data = d_dimar)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2510.3  -297.4   -29.8   293.9  3507.5 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## Age_m     -2.224e+02  5.389e+01  -4.127 3.98e-05 ***
## Age_y     -9.366e+01  4.843e+01  -1.934   0.0534 .  
## Salary     2.348e-02  7.682e-04  30.570  < 2e-16 ***
## Ownhome_b  5.815e+01  4.932e+01   1.179   0.2386    
## Gender_b  -1.193e+01  3.972e+01  -0.300   0.7639    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 681.2 on 995 degrees of freedom
## Multiple R-squared:  0.8079, Adjusted R-squared:  0.8069 
## F-statistic: 836.8 on 5 and 995 DF,  p-value: < 2.2e-16

Model 1, which includes an intercept, reveals that Age (middle and young) and Salary significantly influence Amount Spent. Owning a home and gender (binary) are not significant predictors. This model explains about 50.33% of the variance in spending. Model 2, without an intercept, suggests a higher explanatory power (80.79%) with all variables directly influencing Amount Spent. However, only Age (middle) and Salary remain significant predictors. The substantial increase in R-squared in Model 2 might be due to the absence of an intercept, forcing the model to attribute all variance to the included predictors, which can lead to overestimation of their effects.

6.2.4. AmountSpent ~ All remaining attributes

fit <- lm(AmountSpent ~ Married_b+Location_b+Age_m+Age_y + Salary + Ownhome_b + Gender_b + Hist_m + Hist_h + Hist_m , data = d_dimar)
summary(fit)
## 
## Call:
## lm(formula = AmountSpent ~ Married_b + Location_b + Age_m + Age_y + 
##     Salary + Ownhome_b + Gender_b + Hist_m + Hist_h + Hist_m, 
##     data = d_dimar)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2215.05  -335.47   -36.96   255.35  3059.31 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.065e+02  8.151e+01   7.440 2.18e-13 ***
## Married_b   -2.117e+01  5.492e+01  -0.386 0.699938    
## Location_b  -4.849e+02  4.346e+01 -11.156  < 2e-16 ***
## Age_m       -1.929e+02  5.510e+01  -3.500 0.000485 ***
## Age_y       -1.966e+02  6.465e+01  -3.042 0.002416 ** 
## Salary       1.886e-02  1.231e-03  15.316  < 2e-16 ***
## Ownhome_b    2.093e+01  4.511e+01   0.464 0.642732    
## Gender_b    -5.180e+01  3.988e+01  -1.299 0.194256    
## Hist_m      -1.451e+02  5.101e+01  -2.845 0.004538 ** 
## Hist_h       4.223e+02  5.796e+01   7.286 6.49e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 599.7 on 990 degrees of freedom
## Multiple R-squared:  0.6142, Adjusted R-squared:  0.6107 
## F-statistic: 175.1 on 9 and 990 DF,  p-value: < 2.2e-16
fit <- lm(AmountSpent ~ 0+ Married_b+Location_b+Age_m+Age_y + Salary + Ownhome_b + Gender_b + Hist_m + Hist_h + Hist_m , data = d_dimar)
summary(fit)
## 
## Call:
## lm(formula = AmountSpent ~ 0 + Married_b + Location_b + Age_m + 
##     Age_y + Salary + Ownhome_b + Gender_b + Hist_m + Hist_h + 
##     Hist_m, data = d_dimar)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2441.63  -302.17    22.27   302.10  3152.37 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## Married_b  -7.275e+01  5.596e+01  -1.300  0.19385    
## Location_b -3.712e+02  4.179e+01  -8.882  < 2e-16 ***
## Age_m      -5.594e+01  5.334e+01  -1.049  0.29454    
## Age_y       1.403e+02  4.738e+01   2.962  0.00313 ** 
## Salary      2.254e-02  1.158e-03  19.470  < 2e-16 ***
## Ownhome_b   9.737e+01  4.511e+01   2.158  0.03114 *  
## Gender_b    8.014e+01  3.668e+01   2.185  0.02914 *  
## Hist_m     -4.537e+01  5.055e+01  -0.898  0.36958    
## Hist_h      4.949e+02  5.867e+01   8.435  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 615.9 on 991 degrees of freedom
## Multiple R-squared:  0.8436, Adjusted R-squared:  0.8422 
## F-statistic: 593.8 on 9 and 991 DF,  p-value: < 2.2e-16

Model 1, with an intercept, shows that Location, Age, and Salary are significant predictors of Amount Spent, with Location having a notably negative effect. The model explains 61.42% of the variance in spending. In contrast, Model 2, omitting the intercept, attributes more variance (84.36%) to the included predictors and identifies additional significant variables like Ownhome and Gender. The increase in R-squared in Model 2 suggests that when all variance is attributed to the predictors, their effects are overestimated. Model 1’s approach is more conservative and potentially more realistic, acknowledging other unaccounted factors influencing spending.

7. Conclusion

  • The data is great for linear regression exercises but the encoded attributes is not good enough, and there are some missing values in History attributes.
  • Data analysis is the most interesting part when we could get the insights of the data set
  • Some simple resgression are built but the results are quite good.
  • Some machine learning algorithms could be used in further works like: KNN, SVR, Random Forests,…